Load the data set you exported in the final Task of Case Study 2. Eliminate all observations with missing values in the income status variable.
As a reminder, the data set includes world data from 2020, focusing on:
for most world entities in 2020. The data was downloaded from https://www.cia.gov/the-world-factbook/about/archives/. Additional information on continent, subcontinent/region and income status was appended to the dataset in Case Study 2.
library(ggplot2)
library(dplyr)
##
## Attache Paket: 'dplyr'
## Die folgenden Objekte sind maskiert von 'package:stats':
##
## filter, lag
## Die folgenden Objekte sind maskiert von 'package:base':
##
## intersect, setdiff, setequal, union
library(forcats)
## Warning: Paket 'forcats' wurde unter R Version 4.4.3 erstellt
Using ggplot2, create a density plot of the education expenditure grouped by income status. The densities for the different groups are superimposed in the same plot rather than in different plots. Ensure that you order the levels of the income status such that in the plots the legend is ordered from High (H) to Low (L).
The color of the density lines is black.
The area under the density curve should be colored differently among the income status levels.
For the colors, choose a transparency level of 0.5 for better visibility.
Position the legend at the top center of the plot and give it no
title (hint: use element_blank()).
Rename the x axis as “Education expenditure (% of GDP)”
Comment briefly on the plot.
data_case_study2 <- read.csv("world_data_2020_tidy.csv", header = TRUE, sep = ";")
head(data_case_study2)
## country.x iso_code.x continent.x subcontinent.x net_migration_rate
## 1 Afghanistan AFG Asia Southern Asia -0,1
## 2 Albania ALB Europe Southern Europe -3,3
## 3 Algeria DZA Africa Northern Africa -0,9
## 4 American Samoa ASM Oceania Polynesia -26,1
## 5 Andorra AND Europe Southern Europe 0
## 6 Angola AGO Africa Sub-Saharan Africa -0,2
## youth_unempl_rate education_expenditure income_group
## 1 17,6 4,1 Low income
## 2 31,9 3,6 Upper middle income
## 3 39,3 . Upper middle income
## 4 . . High income
## 5 . 3,2 High income
## 6 39,4 3,4 Lower middle income
nrow(data_case_study2)
## [1] 227
data_case_study2 <- data_case_study2[data_case_study2$income_group != ".", ]
nrow(data_case_study2)
## [1] 212
# transforming variables to numeric values
data_case_study2$education_expenditure <- as.numeric(gsub(",", ".", data_case_study2$education_expenditure))
## Warning: NAs durch Umwandlung erzeugt
data_case_study2$net_migration_rate <- as.numeric(gsub(",", ".", data_case_study2$net_migration_rate))
data_case_study2$youth_unempl_rate <- as.numeric(gsub(",", ".", data_case_study2$youth_unempl_rate))
## Warning: NAs durch Umwandlung erzeugt
nrow(data_case_study2)
## [1] 212
# ordering income groups like in task
data_case_study2$income_group <- factor(
data_case_study2[["income_group"]],
levels = c("High income", "Upper middle income", "Lower middle income", "Low income")
)
# density plot
ggplot(data_case_study2, aes(x = education_expenditure, fill = income_group)) +
geom_density(color = "black", alpha = 0.5) +
scale_fill_discrete(name = NULL)
## Warning: Removed 47 rows containing non-finite outside the scale range
## (`stat_density()`).
labs(x = "Education expenditure in % of GDP") +
theme(
legend.position = "top",
legend.justification = "center",
legend.title = element_blank()
)
## NULL
Analyzing the plot, we can see that there are more countries in group “High income” and “Upper middle income”. Further, we can see that countries being in groups “lower middle income” and “upper middle income” spend the highest portion of their gdp for education. This makes sense, as education should be a basic need.
One can also see, that a few countries from groups “High income” and “Upper middle income” are spendingbelow 2,5% of their gdp on education, while many countries from groups “Lower middle income” and “Low income” spend below 2,5% of their gdp on education.
Investigate how the income status is distributed in the different continents.
Using ggplot2, create a stacked barplot of absolute frequencies showing how the entities are split into continents and income status. Comment the plot.
Create another stacked barplot of relative frequencies (height of the bars should be one). Comment the plot.
Create a mosaic plot of continents and income status using base R functions.
Briefly comment on the differences between the three plots generated to investigate the income distribution among the different continents.
# stacked barplot with absolute frequencies with removed title of legend
ggplot(data_case_study2, aes(x = continent.x, fill = income_group)) +
geom_bar(position = "stack") +
labs(x = "Continent", y = "Number of Countries", fill = "Income Group") +
theme_minimal() +
theme(legend.position = "top")
On the stacked barplot with absolute frequencies we can see how many
countries exist in each income group on each continent.
# stacked barplot with relative frequencies with removed title of legend
ggplot(data_case_study2, aes(x = continent.x, fill = income_group)) +
geom_bar(position = "fill") +
labs(x = "Continent", y = "Proportion", fill = "Income Group") +
scale_y_continuous(labels = scales::percent) +
theme_minimal() +
theme(legend.position = "top")
On the stacked barplot with relative frequencies we can see what the
proportion of countries in each income group on each continent is.
# contingency table for mosaic plot
tbl <- table(data_case_study2$continent.x, data_case_study2$income_group)
# mosaic plot
mosaicplot(tbl, main = "Income Group by Continent", xlab = "Continent", ylab = "Income Group", color = TRUE, las = 1)
On the mosaic plot we can see how countries on each continent are split
between all present income groups. The first plot shows the absolute
amount of countries in each income group. THis plot is especially
effective when a user needs to analyze a question with absolute numbers.
The second plot plots the same, but in a relative manner. It can be used
to compare the distribution of countries on each continent within each
income group very fast and intuitive without computing number of
countries in the brain. Each continent can be compared on its own with
all other continents very effective.
The last mosaci plot shows also the relative distribution of countries of a continent within income groups. The main message and design is similar to the stacked barplot with relative frequencies. But there are still two differences that let us reject this type of plot in the following task. Firstly, the income groups are not defined as color with a legend anymore, but on the x-axis. The percentage of relative frequencies of a country is moved to the spectator to calculated the distribution of each continent in its brain. Secondly, the axis labeling of the y-axis has moved from bottom to the top, which is an unusual place.
For Oceania, investigate further how the income status distribution is in the different subcontinents. Use one of the plots in b. for this purpose. Comment on the results.
# creating oceania dataset
oceania_data <- data_case_study2 %>%
filter(continent.x == "Oceania", !is.na(subcontinent.x), !is.na(income_group))
# ordering incpome group data
oceania_data$income_group <- factor(oceania_data$income_group,
levels = c("High income", "Upper middle income",
"Lower middle income", "Low income"))
# plotting relative frequencies of income group distribution
ggplot(oceania_data, aes(x = subcontinent.x, fill = income_group)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent_format()) +
labs(x = "Subcontinent",
y = "Proportion of Countries",
fill = "Income Group",
title = "Income Status Distribution in Oceania by Subcontinent") +
theme_minimal() +
theme(legend.position = "top",
legend.title = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1))
We have chosen the stacked barplot with relative frequencies of
countries, because it is a good way to compare the distribution income
groups of each subcontinent. Analyzing Oceania a main continent, we can
see that its Auuuuuuuuuuuuuuuuustralia and New Zealand subcontinent part
has only countries that belong into the group “High income”.
The distribution of countries in income groups of the other three subcontinents can easily be compared by each other using that type of plot. From that, we can see fast that Micronesia has the highest proportion of high income countries and Melanesia has the lowest proportion of high income countries of those three chosen subcontinents except of “Australia and New Zealand”. ## d. Net migration in different continents
Using ggplot2, create parallel boxplots showing the distribution of the net migration rate in the different continents.
Prettify the plot (change y-, x-axis labels, etc).
Identify which country in Asia constitutes the largest negative outlier and which country in Asia constitutes the largest positive outlier.
Comment on the plot.
data_box <- data_case_study2 %>%
filter(!is.na(continent.x), !is.na(net_migration_rate))
# boxplott by continent
ggplot(data_box, aes(x = continent.x, y = net_migration_rate)) +
geom_boxplot(outlier.color = "red") +
labs(
title = "Distribution of net migration rate by continents",
x = "Continent",
y = "Net migration rate per 1000 population"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(face = "bold", hjust = 0.5)
)
asia_data <- data_box %>% filter(continent.x == "Asia")
# calc iqr ranges
q1 <- quantile(asia_data$net_migration_rate, 0.25, na.rm = TRUE)
q3 <- quantile(asia_data$net_migration_rate, 0.75, na.rm = TRUE)
iqr <- q3 - q1
lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr
# investigate outliers
asia_outliers <- asia_data %>%
filter(net_migration_rate < lower_bound | net_migration_rate > upper_bound)
# biggest positive and negative outlier
asia_outliers %>%
arrange(net_migration_rate) %>%
select(country.x, net_migration_rate) %>%
slice(c(1, n()))
## country.x net_migration_rate
## 1 Lebanon -88.7
## 2 Syria 27.1
Comparing the boxplots with each other, we can see that oceania has the biggest proportion of people migrating from its continent. Asia has most widely distributed outliers to both positive and negative borders of net migration rate.
We calculated both outliers, the largest positive one and the largest negative one. The negative outlier was Lebanon, which says many people could be migrating from Lebanon. The positive outlier is Syria, which means that many people could be migrating to Syria.
The graph in d. clearly does not convey the whole picture. It would be interesting also to look at the subcontinents, as it is likely that a lot of migration flows happen within the continent.
Investigate the net migration in different subcontinents using
again parallel boxplots. Group the boxplots by continent (hint: use
facet_grid with scales = "free_x").
Remember to prettify the plot (rotate axis labels if needed).
Describe what you see.
migration_data <- data_case_study2 %>%
filter(!is.na(continent.x), !is.na(subcontinent.x), !is.na(net_migration_rate))
# boxplott by subcontinent
ggplot(migration_data, aes(x = subcontinent.x, y = net_migration_rate)) +
geom_boxplot(fill = "lightblue", color = "black", outlier.color = "red") +
facet_grid(. ~ continent.x, scales = "free_x", space = "free_x") +
labs(
title = "Net migration rate by subcontinent (grouped by continent)",
x = "subcontinent",
y = "Net migration rate (per 1,000 population)"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
strip.text.x = element_text(face = "bold"),
plot.title = element_text(face = "bold", hjust = 0.5)
)
On the boxplots showing the net migration rate for each subcontinent, we
can see that there is a flow between oceanian subcontinents that
indicates a regional movement especially from Micronesia and Polynesia
(lowest net migration rate) to Australia and New Zealand (largest net
migration rate in considering continent Oceania and overall!). The
people from Micronesia and Polynesia could also move
to other continents, while poeple from other continents could move to
Australia and New Zealand. So, to justify that hypothesis, we would need
to know the connections of those movements.
Further, we can see that Polynesia, Micronesia and Western Asia have a huge variation due to wide interquartile ranges.
We also can see that Latin America and the Carribean and Sub-Sahara Africa have the most outliers which are not extreme, but clustered in similarly closely in both the positive and negative direction from the median.
The plot in task e. shows the distribution of the net migration rate for each subcontinent. Here you will work on visualizing only one summary statistic, namely the median.
For each subcontinent, calculate the median net migration rate. Then create a plot which contains the sub-regions on the y-axis and the median net migration rate on the x-axis.
As geoms use points.
Color the points by continent – use a colorblind friendly palette (see e.g., here).
Rename the axes.
Using fct_reorder from the forcats
package, arrange the levels of subcontinent such that in the plot the
lowest (bottom) subcontinent contains the lowest median net migration
rate and the upper most region contains the highest median net migration
rate.
Comment on the plot. E.g., what are the regions with the most influx? What are the regions with the most outflux?
For each subcontinent, calculate the median youth unemployment rate. Then create a plot which contains the sub-regions on the y-axis and the median unemployment rate on the x-axis.
Use a black and white theme (?theme_bw())
As geoms use bars. (hint: pay attention to the statistical
transformation taking place in geom_bar() – look into
argument stat="identity")
Color the bars by continent – use a colorblind friendly palette.
Make the bars transparent (use
alpha = 0.7).
Rename the axes.
Using fct_reorder from the forcats
package, arrange the levels of subcontinent such that in the plot the
lowest (bottom) subcontinent contains the lowest median youth
unemployment rate and the upper most region contains the highest median
youth unemployment rate.
Comment on the plot. E.g., what are the regions with the highest
vs lowest youth unemployment rate?
The value displayed in the barplot in g. is the result of an aggregation, so it might be useful to also plot error bars, to have a general idea on how precise the median unemployment is. This can be achieved by plotting the error bars which reflect the standard deviation or the interquartile range of the variable in each of the subcontinents.
Repeat the plot from Task g. but include also error bars which
reflect the 25% and 75% quantiles. You can use
geom_errorbar in ggplot2.
Using ggplot2, create a plot showing the relationship between education expenditure and net migration rate.
Color the geoms based on the income status.
Add a regression line for each development status (using
geom_smooth()).
Comment on the plot. Do you see any relationship between the two variables? Do you see any difference among the income levels?
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 47 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 47 rows containing missing values or values outside the scale range
## (`geom_point()`).
## j. Relationship between youth unemployment and net migration rate
Create a plot as in Task i. but for youth unemployment and net migration rate. Comment briefly.
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 37 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).
Go online and find a data set which contains the 2020 population for the countries of the world together with ISO codes.
Download this data and merge it to the dataset you are working on in this case study using a left join. (A possible source: World Bank))
Inspect the data and check whether the join worked well.
First, we simply read in the dataset from the csv file.
## country.x iso_code.x continent.x subcontinent.x net_migration_rate
## 1 Afghanistan AFG Asia Southern Asia -0.1
## 2 Albania ALB Europe Southern Europe -3.3
## 3 Algeria DZA Africa Northern Africa -0.9
## 4 American Samoa ASM Oceania Polynesia -26.1
## 5 Andorra AND Europe Southern Europe 0.0
## 6 Angola AGO Africa Sub-Saharan Africa -0.2
## youth_unempl_rate education_expenditure income_group population_2020
## 1 17.6 4.1 Low income 39068979
## 2 31.9 3.6 Upper middle income 2837849
## 3 39.3 NA Upper middle income 44042091
## 4 NA NA High income 49761
## 5 NA 3.2 High income 77380
## 6 39.4 3.4 Lower middle income 33451132
Now, lets check, whether everything worked well.
## country.x iso_code.x continent.x subcontinent.x net_migration_rate
## 1 Taiwan TWN Asia Eastern Asia 0.8
## youth_unempl_rate education_expenditure income_group population_2020
## 1 NA NA High income NA
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.040e+04 7.861e+05 6.727e+06 3.655e+07 2.468e+07 1.411e+09 1
A single NA value suggests, that the left-join worked well. Further, all population values lie in a reasonable range.
Make a scatterplot of education expenditure and net migration rate for the countries of Europe.
Scale the size of the points according to each country’s population.
For better visibility, use a transparency of
alpha=0.7.
Remove the legend.
Comment on the plot.
In the scatter plot, we notice, that countries with large populations are more centered. On the one hand, this means more consistent values for the migration rate, which makes sense as it is measured per 1000 population. On the other, we find relatively consistent amounts of money spend on education for high population countries (around 4-6%), while countries with less population show more variation.
From the scatter plot, we find a slight positive correlation between education expenditure and net migration rate. We could conclude, that boosting education budgets may help attract or retain people. However, this trend is rather subtle in the data, and secondly, this might be heavily influenced by other factors and represent only a small picture of a far more complex setting.
On the merged data set from Task k., using function
ggplotly from package plotly
re-create the scatterplot in Task l., but this time for all countries.
Color the points according to their continent.
When hovering over the points the name of the country, the values for
education expenditure, net migration rate, and population should be
shown. (Hint: use the aesthetic text = Country. In
ggplotly use the argument
tooltip = c("text", "x", "y", "size")).
## Warning: Paket 'plotly' wurde unter R Version 4.4.3 erstellt
##
## Attache Paket: 'plotly'
## Das folgende Objekt ist maskiert 'package:ggplot2':
##
## last_plot
## Das folgende Objekt ist maskiert 'package:stats':
##
## filter
## Das folgende Objekt ist maskiert 'package:graphics':
##
## layout
In parallel coordinate plots each observation or data point is depicted as a line traversing a series of parallel axes, corresponding to a specific variable or dimension. It is often used for identifying clusters in the data.
One can create such a plot using the GGally R package. You should create such a plot where you look at the three main variables in the data set: education expenditure, youth unemployment rate and net migration rate. Color the lines based on the income status. Briefly comment.
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
In the parallel coordinate plot, we colored the lines according to the income status of that country. However, as the lines of the different income groups heavily overlap, we can not identify any obvious income-related patterns or clusters in the data.
Many of the lines slope downward from a high net_mgiration_rate to low youth_unempl_rate, and upward from low net_migration_rate to high youth_unempl_rate. This hints a negative correlation between the two variables. However, as a parallel coordinate plot isn’t ideal for checking correlation (e.g. because of the independently scaled variables), this trend should be checked statistically.
Create a world map of the education expenditure per country. You can use the vignette https://cran.r-project.org/web/packages/rworldmap/vignettes/rworldmap.pdf frmto find how to do this in R. Alternatively, you can use other packages (such as ggplot2, sf and rnaturalearthdata) to create a map.
## Warning: Paket 'rworldmap' wurde unter R Version 4.4.3 erstellt
## Lade nötiges Paket: sp
## Warning: Paket 'sp' wurde unter R Version 4.4.3 erstellt
## ### Welcome to rworldmap ###
## For a short introduction type : vignette('rworldmap')
## 211 codes from your data successfully matched countries in the map
## 1 codes from your data failed to match with a country code in the map
## 32 codes from the map weren't represented in your data
The plot shows the education expenditure for each country on a world map, with small expenditures colored yellowish and high expenditures colored reddish.
We notice some expected trends: Many African, Middle-East and South-Asian countries have low education expenditures, while most countries in Europe and North America tend to invest more.
However, some trends are also surprising: Sub-Saharan Africa is mixed, with many south-eastern countries having high education expenditures, while many north-western countries tend to have lower. Further, most countries in South America, e.g. Brazil, have high education expenditures, although being known as developing countries, that still face many social and economic issues. One possible explanation could be that the education system is not well developed, but the government still spends a lot of money on it.